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Abstract 

Abstract. In this paper we study grouped variable selection problems by 
proposing a specified prior, called the nested spike and slab prior, to model 
collective behavior of regression coefficients. At the group level, the nested 
spike and slab prior puts positive mass on the event that the /2-iiorm of the 
grouped coefficients is equal to zero. At the individual level, each coefficient is 
assumed to follow a spike and slab prior. We carry out maximum a posteriori 
estimation for the model by applying blockwise coordinate descent algorithms 
to solve an optimization problem involving an approximate objective modified 
by majorization-minimization techniques. Simulation studies show that the 
proposed estimator performs relatively well in the situations in which the true 
and redundant covariates are both covered by the same group. Asymptotic 
analysis under a frequentist's framework further shows that the I2 estimation 
error of the proposed estimator can have a better upper bound if the group 
that covers the true covariates does not cover too many redundant covariates. 
In addition, given some regular conditions hold, the proposed estimator is 
asymptotically invariant to group structures, and its model selection consis- 
tency can be established without imposing irrepresentable-type conditions. 
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1 Introduction 



Variable selection has long been an important issue in regression-based statistical 
analysis. Recently, many efficient methods have been developed to tackle the prob- 
lems in the situation when the number of covariates is large. At the same time, many 
efforts have also been made in understanding the statistical properties of these meth- 
ods. In this paper we focus on grouped variable selection problems. More specifically, 
we study variable selection in the following regression model: 




+ ■■■ + 
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where yi is the response variable for subject i, Gk C {1,2, ■ ■ ■ ,p} is the index set 
associated to the kth group, and is the corresponding error term following some 
specified distribution. Throughout the paper, we focus on non-overlapping cases, i.e. 
for two index sets Gk and Gk' with k,k' G {1, 2, ■ ■ ■ , m}, we assume Gk H Gk' = 
for k k'. Now let denote the regression vector with entries indexed by Gk- 
Grouped variable selection aims to select covariates groupwisely, that is, entries in 
either estimated with non-zero values or they are all estimated with zero 
values. In grouped variable selection, one benchmark method for estimating /3 = 
(^Gi, ^Ga, ■ ■ ■ , Pcm) is the group lasso [32|: 
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arg mm 



2 y-J2^GjG, +xJ2^k\\M\2\, (1.2) 

k=l 2 k=l ^ 



where Xq^. is an n x \Gk\ matrix representing the covariates indexed by G^, A > is 
the tuning parameter, and Wk is a specified weight corresponding to the fcth group. 

The group lasso estimator (11. 2p has several advantages over the lasso in dealing 
with the variable selection problem associated with model (II. ip . First, since the 
/2-norm ll/^Gfelb is not separable in (3g,., the group lasso provides a more suitable way 
for regression coefficient estimation when either covariates have meaningful interpre- 
tations as a whole 19, 22, 1], or they can be expressed as a group of dumrny variables 
32 1, or they are represented as linear combinations of basis functions jl, lls], 14\. In 



addition, as shown in |13l. Il6j. given some regular conditions hold, the I2 estimation 
error of (II. 2p can have an order of magnitude similar or even smaller than that of 
the lasso estimator. Moreover, like the lasso, (II. 2p can also enjoy model selection 



consistency if some irrepresentable-type conditions are satisfied [l, [23[ 22. E^. 

Note that the group lasso estimator (II. 2p is only able to produce between-group- 
sparsity, that is, once the l2-noTm \\/3gJ\2 is estimated with a non-zero value, all 
entries in '^iii be estimated with non-zero values. However, sometimes the pre- 
specified group structure may not exactly cover the true covariates. As a result of 
that, redundant covariates may be wrongly selected in the model, along with the true 
covariates. To correct this, one need to consider within-group-sparsity. Friedman et 
al. M proposed the sparse group lasso estimation by adding an /i penalty to the 
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objective function stated in fll.2p . Under the sparse group lasso estimation, within- 
group-sparsity can be reached, since with the h penalty the regression coefficients in 
the active groups are allowed to have zero-valued estimates. 

In this paper we will study the grouped variable selection problem by developing 



a specified spike and slab prior [2l|, called the nested spike and slab prior, to model 
the group regression coefficient vector The nested spike and slab prior assigns 
positive mass on events {H/^Cfelb = 0} and {H/Scfelb 7^ 0} to represent the sparsity 
between group coefficient vectors (3gi, (3g2, ■ ■ ■ ^ Pcm- Given that ||/5Gi,||2 7^ 0, it 



further assigns each entry in /^^ with a spike and slab prior [2l|. Under the nested 



spike and slab prior, sparsity between groups and sparsity within a group can be 
achieved simultaneously with a positive probability. 

We then develop a method to carry out maximum a posteriori (MAP) estimation 
for the model. More specifically, we formulate the estimation problem as an optimiza- 
tion problem in which the objective function is approximated by the majorization- 



minimization algorithms [15|, |30| . We then solve the optimization problem by propos- 



propos- 

ing blockwise coordinate descent algorithms based on the ideas developed in [lO, Isf. 
Simulation studies show that the proposed estimator performs relatively well in the 
situations in which the within-group-sparsity is present. However, its performance 
may get deteriorated if the true covariates are scattered over a large number of groups 
that contain many redundant covariates. 

Further we will show that under a frequentist's framework, the proposed MAP 
estimator can have a better I2 estimation error bound if the number of groups that 
cover the true covariates and the numbers of redundant covariates in such groups are 
small. In addition, if some regular conditions on tuning parameters hold, the values 
of the proposed estimates will be asymptotically invariant to group structures. We 
will also establish model selection consistency for the proposed estimator. The result 
does not require one to impose the irrepresent able-type conditions. 

The paper is organized as follows. In Section [3] we develop the nested spike and 
slab prior and construct a Bayesian hierarchical model based on the proposed prior. 
We then present a method to carry out maximum a posteriori estimation for the 
model. In Section H] we conduct a simulation study to demonstrate finite sample 
properties of the proposed estimator. In Section [5] we establish asymptotic results 
for the proposed estimator under a frequentist's framework. Section |6] contains two 
real data examples. Section [7] is the discussion. 



2 Notation 

For the kth. index set Gk, we let qu denote the number of elements in it, i.e. qk = \Gk\- 
For the jth covariate, we let kj denote the index for the group that j belongs to, 
that is, if j G Gk', then kj = k'. For a p-dimensional vector b = {bi,b2, ■ ■ ■ ,bp), we 
define {b)j = bj and bc^. be the vector whose entries are those indexed by G^ in b. 
For the vector b, we define the associated /i-norm by ||6||i = ^^=1 \bj\ and /2-norm 
by II&II2 = (Sj=i \bj\^y^'^- We define the sign function of z by sign(z) = 1 if 2; > 0; 
sign^z) = — 1 if 2; < 0; sign(2;) = if 2 = 0. Finally, we define the soft-thresholding 



3 



operator STx by 



STx{z) = sign{z){\z\ - A)+. 



(2.1) 



3 Nested spike and slab prior 

Since our aim is to jointly select covariates indexed by G^, therefore the information 
about whether is a zero vector or not is crucial. Practically, we assign probability 
mass on event {H/Scfelb 7^ 0} to express our belief that f3G^ is not a zero vector. Let 
6k denote the probability. With 6^, we further assume follows a distribution 
which has a density given by 



where uj G [0, 1], g{f3j) is some specified density defined on M\(— ^, ^), and 5o(| l/^Cfcl I2) 
is the Dirac delta function centered at event { | \(3g^ | I2 = 0}. The density (13. ip is called 
the nested spike and slab prior, since the joint spike and slab prior assigned on entries 
in individual level is wrapped by a spike and slab prior assigned at the 

group level. The nested spike and slab prior (13. ip implies that /3g^. has probability 
6k to be a non-zero vector. In addition, given that /Sgs, is not a zero vector, the 
entries in /^g^ are independently distributed, and each entry will have probability Uj 
to follow a distribution with density g{/3j) and probability 1 — Uj to fall uniformly in 
the region (-^,0- 

For practical purposes, we introduce two sets of Bernoulli variables 7 = (71, 72, ■ ■ ■ , 7m) 
and a = (ai,a2, ■ ■ ■ ,C(p)- The former will be used to model regression coefficients 
at the group level while the latter will be used to model regression coefficients at the 
individual level. Below we reformulate the nested spike and slab prior (13. ip in terms 
of a and 7. For group k, we let 7^ ~ Bernoulli(6'fc). For j G Gk, we assume aj\ 7^ = 1 
~ Bernoulli (cjj). Here aj is defined conditional on 7^ = 1, reflecting the nested struc- 
ture of (13. ip . Now conditional on 7^ and aci,, the density /(/^GsI 7fc,«Gfe) has the 
same format as the nested spike and slab prior (13. ip with 6^ replaced by 7^ and Uj 
replaced by aj. Further it can be shown that the expectation E^^ [fif^Gkllk, c^Gk)] 
is the nested spike and slab prior (13. ip . In addition, given 7^ and ac^. are known, 
the prior density /(/Sg^I lk,C(Gk) has an equivalent representation: 



Below we will use the augmented form (13. 2 p to derive the joint posterior density of 
(3, a and 7. 




+(1-^.)5o(||/3gJ|2), 



(3.1) 




(3.2) 
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3.1 Model 



We now turn back to regression model (11.11) . With the prior setting given above, we 
can construct a hierarchical Bayesian model and carry out inference on parameters 
in fll.ip . For practical purposes, we will only focuses on a situation in which the 
region (— ^,0 is a small region concentrating around 0, that is, — )■ 0. Under 
this situation, we can represent (11.11) in terms of Bernoulli variables a and 7 by 
Vi- ~ Sfc^i7fc(SjeGfe ^ij^jt^j) + ^i- Given that there are n subjects, we assume 



Normal < ^^7fc( j , o"^ >, for 

I k=i ^ieGfc ^ J 



i = 1,2, ■ ■ ■ ,n. 



Ik 



Y[ <j ajNormal(0, (7^A-^)I{R \ (-^, 0} 
ieGfc 



+ (l-7fe)5o(||A 



Gk\\2) 



for = 1, 2, • • • , m, 



"gJ 7k,ujGk 

7fc| 



5o(«gJ^"^N for = 1,2, 





ieGfe 

Bernouni(4), for A; = 1,2 



m. 



(3.3) 



Under hierarchical Bayesian model (13. 3p . the joint posterior density of (3, a and 7 is 
given by 

/(/3,«,7l y,^, A,ct2,w,^) 

cx /(yl X, /3, a, 7, a2)/(/3| a, 7, c^^ A)/(«| 7, c^)/(7l (3.4) 

where y = {yi,y2, - ■ ■ ,yn), and for notational simplicity, similar definitions are ap- 
plied to CO and 6. With the joint posterior density (13. 4p . various methods can be 
proposed to make inference on the parameters. Here we adopt the maximum a pos- 
teriori (MAP) approach to carrying out the parameter estimation. We define the 
maximum a posteriori estimator for /3, a and 7 by 



where 



(/3, a, 7) = arg min -2 log /(/3, a, 7] y, X, A, w. 



-2 log /(/3, a, 7I ?/, X, A, (T^ w, 0) 

= -21og/(y| X,/3,a,7,a2) 

-21og{/(/3| «,7,a^A)/(a| 7,^)/(7l ^)} 
— 2 logjnormalizing constant}. 



(3.5) 
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3.2 Parameter estimation 



By definition, we can write 7^ = I{||/3gJ|2 7^ 0} and aj = ^ (-^,01 pG^Jb 7^ 
0}, where kj is the index for the group that j belongs to. With argumented rep- 
resentation (13. 2p . the second term on the right hand side of (13. 5 p can be expressed 



as 



-21og{/(/3| a,7,a^A)/(«|7,u;)/(7l 0)} 



k=l ^j&Gk ^ 



k=l jGGk 



k=i jeGk '- 

m f 

k=\ ^ 



log 



IkOij 



n 



Ik- 



1 -Wj 

0, (l-7A.)log(5o(||/3Gj|2 



(3.6) 



Here we have used the facts that (1 — aj) log5(_5^g)(/3j' 
0, and (1 - 7fc) log(5o(aGfc) = in deriving ([3J]). 

In addition, given that ^ — i- 0, we have aj ~ I{f3j 7^ 0| H/^Cfc lb 7^ 0}. Further 
by a direct calculation, we have 7fcj«j = 7^ fl H/^Cfc lb 7^ 0} = I{||/3Gfe.||2 7^ 
0| ^ 0}I{/3j 7^ 0}. Note that the expectation of the index I{||/3Gfe . II2 ^ 0| /S - ^ 0} 
is Pdl/^Gfc-lb 7^ 0| /3j 7^ 0), which is obviously equal to 1 since j G Gk^ and (3j 7^ 
implies H/^Cfc lb 7^ almost surely. This further implies that I{||/3Gfe.||2 7^ 0| /3j 7^ 0} 
is equal to 1 almost surely. Therefore we have 



7.,«, = m ^ 0}. 



(3.7) 



Now consider the hyperparameters A, a^, 6, u. Since there is no easy way to de- 
termine values of these hyperparameters, therefore for practical purposes, we will 
impose some constraints on these hyperparameters. We assume Uj = Ui for all j. 
Further we define pi = a^\og{[{2na^)/\][{l — a;i)/ci;i]^} and assume pi > 0. For 
the fourth term on the right hand side of (13. 6 p that involves O^s, we adopt the 
following parametrization. We will assume all 7fc's in the fourth term on the right 
hand side of (13. 6p have an equal weight. Given that Uj = ui for all j, we can 
choose appropriate O^s from interval [0, 1] to make the weights of 7fc's the same 
for all k. Let 91 be such appropriate value of 9k- With values of 6'^'s, we define 
P2 = a2log{[(l - 9l)/9lY/^{l - coi)-^^}, where = IG^I. We assume pa > 0. 

With (13.71) and the definitions of pi and p2, minimizing (13. 5p with respect to f3, 
a and 7 is equivalent to minimizing the function 



fc=i 



fc=i 



+pi E E ^^^^ ^ 0} + P2 E w^o, 1 12 7^ 0} (3. 

fc=i ieGfc 



k=l 
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with respect to (3. Here we define the gvsnss estimator (Grouped Variable Selection 
via Nested Spike and Slab Priors) as the one that minimizes (13.81) . Below we provide 
a numerical procedure to calculate the gvsnss estimator. 



3.2.1 Majorization-minimization algorithms 

Since the last two terms in (13.81) are discrete in their domain, the minimization 
problem involving (13.81) is combinatorial and in general is considered to be difficult. 
Here we adopt a continuous relaxation procedure to modify (13.81) . More specifically, 
we use the function 

log(l + r->|) 



log(l + T 

to approximate index function I{a ^ 0}. It can be shown that 5't(o) — !{« 7^ 0} 



as r — )■ [26|, |3l|- Figure [T] shows I{a ^ 0} and gria) and the absolute difference 
between the two functions as a function of — log r. Since (13. 9p is continuous on M, 
the combinatorial nature of I{a ^ 0} is relaxed. However, (13. 9 p is not convex in 
a, and using (13. 9p for continuous relaxation on (13. 8 p still makes (13. 8 p remain non- 
convex. We adopt a majorization-minimization approach to tackling this problem. 



Majorization-minimization (MM) algorithms |30| aim to solve difficult minimiza- 
tion problems by modifying the corresponding objective functions so that solution 
spaces of the modified ones are easier to explore. For an objective function V*{a), 
the modification procedure relies on finding a function V**{a; a^"^^) that satisfies the 
following properties: 

V**{a;a^'^^) > V*{a) for all a, 
V**(a('^);a(^)) = V*{a^'^^). (3.10) 

In (I3.10p . the objective function V*{a) is said to be majorized by V** {a; a^'^^) . In 
this sense, V** {a; a^'^'*) is called the majorization function. In addition, (I3.10p im- 
plies that V**{a]a^'^^) is tangent to V*{a) at a^'^\ Moreover if a^'^^^^ is a mini- 
mizer of V**{a;a'^'^'>), then fl3:T0D further implies that V*{a^'^^) = V**{a^'^'^;a^'^'^) > 
Y** (^a^d+i) ^{d)^ > V*{a^'^'^^^), which means that the iteration procedure a^^^ pushes 
V*{a) toward its minimum. 

Now we turn back to function (13. 9p . Note that, since log(a) is a concave function 
of a for a > 0, therefore the inequality 

log(a') + ^ - 1 > log(a) (3.11) 

holds for all a > and a' > 0. Note that the left hand side of (13. lip is convex in a. 
In addition, if we let a = a', then (13. lip becomes an equality, which implies that the 
left hand side of (13. lip satisfies the properties stated in (13.100 . therefore is a valid 
function for majorizing log(a). 



7 



Now by applying (13 .Qp and the left hand side of (I3.1ip to Y^^=i J2j(^Gk ^i/'^i '^}' 
we can establish the following inequality: 

m 

k=l ji^Gk 

^^0^^ og(l + r-i) 



< lim 

T^O log( 



^t(.og(l + .-|/,;|)+l±|-l), (3,12) 
Similarly for ZlfcLill I/^gJ I2 7^ 0}, we have 



fc = l 



X 

k=l 



(3.13) 



3.2.2 Blockwise coordinate descent algorithms 

With the majorization- minimization results (I3.12p and (I3.13p . we can establish an 
iterative scheme to find the minimizer of (I3.8p . In practice, we use the blockwise 
iterative scheme 

W^'^ = argmin|||r_G,-XG,/3G.||^ + A||/3Gj|^ 

+Ai||4%J|i + A20i'^||/3Gj|2} (3.14) 

to find the solution that minimizes (13. Sp . where Ai = pi limT-_i.o[log(l + 

A2 = P2hm^^o[log(l + r-i)]-\ and r^c^ = V - Y^k'^k^Gyf^Gy In addition, for 

3 e Gfc, = lim.^o(r + |)-S and ^i"^) = lim.^o V^kir + Pgjlb)-^ 

With the objective function stated in (I3.14p . one can derive associated KKT 
conditions and solve them for the minimizer Pq^^'' ■ However, the third and fourth 
terms on the right hand side of (I3.14p are not smooth, therefore special attention 
is needed to obtain a gradient-like vector for (I3.14p . Here we adopt a subgradient- 
based approach to tackling this problem. For the idea of subgradients and related 
theoretical properties, please see Section B.5 of By applying the subgradient 
calculus to the objective function in (I3.14p with respect to f3cf,, we can obtain a 
gradient-like vector for the objective function. Then by setting the vector to zero, 
we obtain the subgradient equations 

2X^^r_G, - 2X^X0 Jg, - 2A/3g. - Ai4'>g. - A20f t^c, = 0, (3.15) 
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where hcf, is a subgradient vector of the /i-norm ||/3Gfc||i, and its entry is defined as 
that, for j E Gk hj = 1 if (3j > 0; hj = h* E [-1, 1] if (3j = 0; and hj = -1 if /3j < 0. 
In addition, vg^. is a subgradient vector of the /2-norm H/Scfelb and is defined as 

^ ^ f/^G./PcJb if ||/3gJ|2 7^0, 

\v*a, such that 1 1^;^^ | |i < 1 if | | I2 = 0. 



Below we adopt a method provided by Friedman et ah [10| to solve the subgradient 
equations (13.151) . The method uses a testing procedure to identify whether I3g^ is a 
zero vector or not. First note that, if Pg,. = 0, then the subgradient equations fl3.15p 
becomes 



2X^j.G, - Ai4>G. = X2^i'^VG,. (3.17) 



Now by definition fl3.16p . if H/Scfclb = 0, i.e. is a zero vector, then ||fGfc||2 < 1, 
therefore (13.171) implies that 



\\2X^,r-G, - Aiz?;?>gJ|2 < A20f . (3.18) 
To numerically verify the condition (I3.18p . we need to know hcf.- Friedman et al. 



lOj provided a practical way to estimate hc^ by solving the least squares problem 
min/ig^ \ \2XQ^r_Gk ~ ^i^gI^g^WI subject to —1 < hj < 1 for j E Gk- The resulting 
estimate takes a soft-thresholding form, and by plugging it into (I3.18p . one obtains 



^r^^,,.,(2X^^r_Gj ^<\2^'\ (3.19) 



Note that if condition (I3.19P holds, we let Pg^^'^ ~ 0' otherwise we go further to 
estimate entries in ^Gk with other values. 

Below we describe a numerical procedure for estimating non-zero entries in /^g^- 



First note that, as shown in [29|, the Z2-norm H/Scfelb on the right hand side of (I3.14p 



can be bounded in a way such that 

1 



\\^gJ\^ + T^n^mcJll - WgM) > \\PgJ\2. (3.20) 



GJ 12 



Here the function on the left hand side is convex in Now if we let (3gi^ = (3q^, 
then the equality will hold between the two sides of (I3.20p . Therefore the function on 
the left hand side of (I3.20p majorizes H/Scfclb- With the majorization result (I3.20p . 
we construct the following iterative scheme: 



Mr^'^-"'^ = argnnn \\\r_G,-XGjG,\^' 



(3.21) 
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to obtain The scheme (Km can be approximated by the following iterative 

least squares procedure: 

where ST^^^id^) ^^{XQ^r_Gk) is the soft thresholding operator defined in fl2.ip . A least 

squares result similar to f l3.22p can be found in j^. For large-scale problems, we 
construct a one dimensional soft thresholding scheme to approximate f l3.2ip . The 
soft-thresholding scheme is given by 

where rj*]^- = - Y.j'^j-j' j&Ck^ii' with = /3jfi+i''^2+i) for f < j and 



{di + l,d2+l) 



+ A + 



A20, 



211/3, 



-1,^2)1 



"-qkXQk 



3.3 Determining tuning parameter values 

For Ai, A2 and A, we adopt a grid search strategy to find their optimal values. Here 
we assume that each column of design matrix X is standardized. To find optimal Ai, 
we search along a grid of candidate values in the interval [0, A^], where A^ is defined as 
A^ = 2.05r X maxjg{i^2,- - ,p} \xjy\. To find optimal A2, we search along a grid of can- 
didate values in the interval [0, Ag], where A2 = l.lr x maxfcg{i^2,--- ,m} l|2-^Gfe?/ll2/\/^- 
For A, we assume it decreases with sample size n and is proportional to A2. More 
specifically, we let A = A2/(10r;,). With the reparametrization on A given above, we 
only need to do grid searches for Ai and A2. For parameter r, we let r = 5 x 10~^. 



3.4 Connection with other approaches 

Recent research on variable selection using maximum a posteriori estimation includes 
12, 31|. Armagan et al. [ij developed a shrinkage-based method for variable selection 
based on the generalized double Pareto priors. The idea of using spike and slab 
priors in grouped variable selection has also been adopted by Scheipl et al. |25| , who 
developed an MCMC-based approach to carrying out posterior inference on additive 
regression models. The idea of using ( 13. 9 p in approximating an index function has 
been mentioned in 0, [26, 18, 31|. Tipping 28| has pointed out a connection between 
the log function f l3.9p and the improper Student's t density. 



4 Simulation study 

In this section we study finite sample properties of the gvsnss estimator by fitting 
regression models with simulated data. In the simulation study, we assume the 
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covariates are randomly divided into m groups, and the true covariates, i.e. the 
covariates with non-zero coefficients, are covered by r < m groups. We will focus on 
the following two situations: 

i. The true covariates are covered by the r groups, but at the same time, some 
redundant covariates, i.e. the covariates with zero coefficients, are also covered 
by the r groups. 

ii. The true covariates are re-assigned with different group labels. In this situation, 
r, the number of groups that covers the true covariates, will change. 

To create the ffist situation, we focuses on varying the level of sparsity in the groups 
that contain the true covariates. To create the second situation, we focuses on re- 
assigning covariates to other groups according to some group switching probabilities. 
Under the two situations, each simulation experiment is characterized by the pair 
(spr, mis-labeled), where "spr" denotes the level of within-group-sparsity and "mis- 
labeled" denotes the group switching probability. For a covariate in an active group, 
spr = 0.3 means that the value of its coefficient will have probability 0.3 to be coerced 
to zero, and mis-labeled = 0.3 means that it will be re-assigned with a different group 
label with probability 0.3. 

Below we introduce the basic simulation scheme. For the n x p design matrix 
X, we generate its rows i.i.d. from MVN(0, Jpxp)- For regression coefficients f3 = 
/32, ■ ■ ■ ; /5p), we ffist randomly assign the corresponding covariates into m groups. 
We then choose r < m groups of covariates and generate their coefficients i.i.d. from 
Normal(0, 1). We further set coefficients of the covariates in the rest of m — r groups 
to zero. We then re-proceed each coefficient by either coercing its value to zero or 
re-assigning its covariate with a different group label according to the pre-specified 
values in (spr, mis-labeled). For the error vector e, we generate its entries i.i.d. from 
Normal(0, 1). Finally, we compute the response vector y = X/3 + e. 

4.1 Methods for comparisons 

We conducted two gvsnss estimations for the regression model. The ffist one used 
five fold cross validation for tuning parameter selection. The second one used the 
following logarithm of the Bayes factor: 

log BF(§, null;,) ^ i 'o^ ^-^.)-' ^ 4 l^'^^'^f + ^-"1 

(4,1) 

for tuning parameter selection, where S = {j : /3gvsnss,i 7^ 0}. The logarithm Bayes 
factor (14. ip corresponds to the model that assigns Normal(0, cr'^/X) on Pj and Inverse- 
Gamma(ri, T2) on o"^ with Ti and T2 both approaching to zero. For tuning parameter 
selection, we searched optimal Ai along a grid of 20 candidate values and optimal A2 
along a grid of another 20 candidate values. 
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We also conducted three other estimations for the regression model. The first 
one is the group lasso using five fold cross validation for tuning parameter selection. 
The second one is also the group lasso but using a naive AIC for tuning parameter 
selection. The naive AIC is given by nAIC = \\y — ^/SglHI/^^ + 2sgl, where 
is estimated from the null model and sgl is the number of non-zero entries in /3gl- 
Numerical calculations for the two group lasso estimations were done by using R 



package grplasso [19|. The third one is the lasso using ten fold cross validation for 



tuning parameter estimation. We used R package glmnet to carry out numerical 
computations for the lasso estimation. For all the three estimations, we searched 
optimal tuning parameters along a grid of 100 candidate values. 

We collected three performance measures at each simulation run. The first one 
is the sign-adjusted false positive rate, which is defined as 

gppj^ _ #{j e S : sign(3j) ^ sign(A,ue,j)} 
The second one is the squared I2 estimation error, which is defined as 

^ P 

The third one is the predictive mean squared error, which is defined as 



PMSE 



n 



where n' = 10 x ra, Ui^new and Xj^new are new data points generated under the same 
simulation scheme. 



4.2 Results 

In practice, we let p = 200, m = 10, and r = 2. We considered different values of 
sample size n and the pair (spr, mis-labeled) in generating data points. 

We first considered the scenario in which the group switching probability is zero. 
The results are shown in Figure [21 with the first, second and third rows being the 
plots of SFPR, /2-dis and PMSE, respectively and the first, second and third columns 
being the plots for cases with spr = 0, 0.3, 0.6, respectively. Each point in the plot 
is an average over 100 simulation runs. The results show that the gvsnss estimator 
has relatively good performances over the group lasso in variable selection when the 
level of within-group-sparsity is increasing. In addition, among the five estimations, 
the gvsnss estimation using the Bayes factor has relatively small values in squared I2 
estimation error and PMSE. However, we also noticed that the advantages of using 
group-based estimations such as the group lasso or gvsnss estimations over the lasso 
estimation will gradually disappear as the level of within-group-sparsity increases. 

We then considered scenarios under different group switching probabilities. The 
results are given in Figures [3] and S] for group switching probability equal to 0.1 and 
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0.5, respectively. The results show that the gvsnss estimator can still have relatively 
good performances over other benchmark estimation methods in variable selection. 
However, we also noticed the lasso estimation almost dominates performances in I2 
estimation error and PMSE over group-based estimation methods in these scenarios, 
especially when the group switching probability is high. A high group switching 
probability will lead to an increase in r, the number of groups that cover the true 
covariates. In Section |5] we will give a theoretical explanation to these simulation 
results by deriving an upper bound for the I2 estimation error. 



5 Asymptotic analysis 

In this section we investigate asymptotic behavior of the gvsnss estimator. Before 
presenting these results, we give some notation definitions. For simplicity, we define 
P = Aruc throughout this section. Further define S = {j : (3j 0} and Gr = 
{Gk '■ k G R}, a collection of disjoint index sets G^s indexed by R that covers S, 
i.e. S C Gr. Define s = \S\, the number of non-zero coefficients, qn = \ Uk^R Gk\, 
the number of indices covered by Gr, and r = \R\, the number of groups that cover 
indices for covariates with non-zero coefficients. 
Now consider the following function: 

Vr{w',(3',G') = \\e' -Xw'\\l + X\\w' + f3'\\l 

J^ log{l + T~'\w'^ + f3'^\) 
log(l + r-i) 

™; log(l + r"i|K, +/3^,||2) 

5: — iog(i+;-^) ' ^'-'^ 



k=l 



where e' = y — Xf3', G' = {G'^ : k = 1, 2, ■ ■ ■ ,m} and q[ = At a fixed r', we 

define /S'^' by 

= argmin lim V;(0, /?', G). (5.2) 

Further define S"^' = {j : 7^ 0} and s^' = \S'^'\. Note that if we let r — 0, then 
Vt{0, (3',G') will approach to the objective function f l3.8p . Therefore technically we 
can express the gvsnss estimator as 

Ajvsnss = argmin lim K(0, (3', G). (5.3) 

We further define S = {j : /3gvsnss 7^ 0} and 's = \S\. Note that by definition, as 
r — )■ 0, (15. 2 p becomes /3° = argmin/3/ lim^_>o K-(0, /?', G) = /3gvsnss- As a result of 
that, we have S'^ S and — )• ? as r — )■ 0. 
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5.1 l2 estimation error 

One useful concept to justify the advantage of group-based estimation is the strong 
group sparsity 



13 



We say the true coefficient vector /3 is (sq^tq) strongly group- 
sparse if there exists a collection of index sets Gr = {Gk : k & R} such that 
S C Gr with qji = \Gji\ < So and r = \R\ < tq. For group lasso /3gl defined in 
fll.2l) . Huang and Zhang [l3| showed that if /3 is (so,'ro) strongly group-sparse, then 
given some regular conditions hold, with 1 — a probability, the I2 estimation error 
WPgl — P\\2 = 0(n~^/^A/ So + To log(m/a)). The order of magnitude implies that the 
group lasso estimation can be beneficial if qn, the number of indices in Gr, and r, 
the number of index sets that cover S, are small. 

Here we have to note that directly comparing rates of the I2 estimation error 
between the lasso and group lasso is not easy since it requires one to derive the 
rates under the same assumptions. Lounici et al. |il6.] provided such comparisons 
for multi-task learning cases and showed that the upper bound for the I2 estimation 
error of the group lasso can have an order of magnitude smaller than the lower bound 
for the I2 estimation error of the lasso. 

Below we start our investigation on the I2 estimation error ||/3gvsnss ~ /^Ib by 
deriving a deterministic upper bound for — (3\\2- 

Theorem 5.1. For e = y — X(5, r G [0, 1), and 1 < max(gj:j, s^) < p, we have, 



< 



1/2 

(1r 



\X 



n 



+ 2 max \ Bj\ — 
jes ' -"n 



+ 



2C2 ^ + 1 
log(r-i) 



Pi + P2 

n 



(5.4) 



where 



n 



mm,, 



w'^X^Xw, C2 = min^.gg^^ |/3J|, and C3 = minjg5 \f3j\. 



Theorem 15.11 does not rely on any distribution assumption on the error vector e. 
It is stated in a deterministic way and does not have any probabilistic interpretation. 

Below we will give some conditions that are useful in deriving upper bounds for 
||/3gvsnss ~ /3||2 in a situation in which some distribution assumption is imposed on e. 

Assumption 1. Let Kn be the same as the one defined in Theorem 15.41 We 
assume k„ + Xn^^ > as — )■ 00. 



Assumption 1 is similar to Condition Al in [3J|. It mainly serves as a statement to 
guarantee that the minimum eigenvalue of the matrix n~^{X^X + XIpxp) is positive 
when n — 7- 00. Note that without Assumption 1, /t„ will be equal to zero when 



n < j9 < 00, but the minimum eigenvalue value k„ + n~^A = n~ 
positive if A > 0. Assumption 1 further implies that \fn(^Kn + An^^' 
n — )■ 00. 



^A will remain 
— > 00 when 



Theorem 5.2. Assume that ei's are i.i.d. as Normal{0,(j'^). Further assume that 



n 



-1 



e: 



^2 

=1 -^ij 



Cj, for i 



1,2, 



n 



A = Ailpn, Pi = A2lpn, P2 = A^ilpr, 
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with Ai, A2 and being some positive constants, and 



ipn = 2(T ^ I 2n max Q 









log 




+ logg 







(5.5) 



where a is a non-negative constant and q = m ^ XlfcLi Ik- Then given that Assump- 
tion 1 holds, for 1 < max(gij, s) < p, with 1 — a probability, we have 



HA 



gvsnss 



< 



^Jn{Kn + fin) 



vn 

log ( — ) +logg 



(5.6) 



as n 00, where 



+(^2 + ^3 



2 max 

iG5 



log(n) =^ , 



2Ala^ 



2 maXj 



log f ^ ) + log q 



(5.7) 
(5.8) 



where 's = \S\, 02 and C3 are defined in Theorem 5.4 



The deterministic result stated in Theorem 15.11 will serve as a bone for deriving 

therefore effectively 



n 



upper bound (15. 6p . Note that since we have assumed r 
we have /3'^ — )■ /3gvsnss and S'^ ^ S as n ^ 00. Detailed derivations of Theorem 15.11 
and Theorem 15.21 are given in Appendix |Al 

Note that the bound (15.61) is proportional to g]/^ and by definition 



Qr 



keR 



#{j G Gk : = 0}. 



Given that s is fixed, the result implies that, if groups that contain the true covariates 
also contain large numbers of redundant covariates, or if the true covariates are 
scattered over a large number of groups, like the scenarios with high group switching 
probabilities we have seen in Section IU then the gvsnss estimator will not perform 
well. 

Now if we adopt an equal group setting, i.e. (?i = (?2, • " " > = Imi and let Q = 



1 for j = 1,2, ■■■ ,p, then qji = \Gr\ = \R\ x \Gj^\ = rqi, and the right hand 
side of (15.61) will have an order of magnitude equal to n^^/^ a/ r log qi + r log(m /a). 
Further note that loggi < qi. Therefore with 1 — a probability, as n —t- 00, we have 
IIAgvsnss - AII2 = O(n"^/^^so + rolog(m/a)), where tq = r and Sq = qR. The result 
given above implies that the gvsnss estimator can achieve an I2 estimation error with 



an order of magnitude proportional to that of the group lasso established in [13 . 

The following corollary states that if the maximum size of groups is equal to one, 
then the gvsnss estimator can have an I2 estimation error with an order of magnitude 
similar to that of the lasso established in 20. Isjl. 
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Corollary 5.1. Assume that maxfc = 1 and Q = 1 for j = 1, 2, ■ ■ ■ ,p. Then given 
that all assumptions stated in Theorem \5.S\ hold, with 1 — a probability, we have 



2 72 a An 



P 



as n ^ oo, where A„ is the same as the one defined in {5.1) and 



= 2A,aJ-\og(^]. 



n \a 



Proof of Corollary 15. ii Obviously given that the maximum group size is one, 
qn = s. In addition, the number of groups is m = p. Then by inserting the results 
given above into the right hand side of (15. 6p . we obtain (15. 9p . which completes the 
proof. □ 



5.2 Lab el- invar iance property 

Here we show that the gvsnss estimator (15. 3p is asymptotically invariant to group 
structures. We consider two collections of index sets G* = {Gl : k = 1,2, ■ ■ ■ ,m*} 
and G** = {Gp : / = 1, 2, ■ ■ ■ , m**}. In the following discussion as well as in the proof 
we will see * and ** attached to various vector- valued quantities and the presence 
of * (or **) in a given vector means that the entries of the vector are indexed by Gl 
(or Gl*) in the original vector. 

Our result relies on the fact that the third term in K-(0, G') allows the gvsnss 
estimation to produce zero estimates for coefficients whose covariates are in active 
groups. Without this setting, we would be unable to establish the label-invariance 
property for some cases, and /3^snss = argmin^/ limT-_>.o K-(0, /?', G*) might never be 
a solution to the subgradient equations of lim^_>o K-(0, G**), where G** is an 
arbitrary collection of index sets. Therefore we assume pi > 0. In addition, our 
result relies on evaluating the difference between the log-sum penalties involving I2- 
norms in V^(0,/3',G*) and V^(0, /?', G**). Since p2 and the size of a group play a 
crucial role in the evaluation process, we will also impose an assumption on their 
orders of magnitude. 

Theorem 5.3. Assume that 

P^* = argniinK(0,/3',G*) 

is the unique solution to the subgradient equations of Vr{0, (3',G*) for all r G [0, 1). 
Further assume that pi > 0, p2 niax^ ^/g^ = o(logn), and r = n"^ . Then as n ^ 

00, /Sgvsnss = argmin^/ lim^_>.o K-(0, /?', G*) is the minimizer 0/ lim^_>o K-(0, /?', G**), 
where G** is an arbitrary collection of index sets. 
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5.3 Variable selection and sign consistency 

Here we study asymptotic behavior of the gvsnss estimator in variable selection. In 
particular, we focus on sign consistency of the estimated coefficients. We explain 
the idea of sign consistency first. An estimator /3(n) is said to be sign consistent in 
estimating {3 if probability P{sign(/3(n)) = sign(/3)} approaches to one as n — > oo. 
Given the sign consistency holds, the estimated index set S(n) = {j : f3j{n) ^ 0} will 
be the same as the true index set S", therefore the sign consistency implies variable 
selection consistency, that is, asymptotically with probability one, non-zero valued 
coefficients will have non-zero estimated values, and zero-valued coefficients will be 
estimated with zero values. 

Below we derive a lower bound for P{sign(/3'^) = sign(/3)}. Then with r = n"^, 
we have Z?"^ — )■ /3gvsnss as n — )■ cx), and in turn, the lower bound for P{sign(/3gvsnss) = 
sign(/9)} can be established asymptotically. The following assumptions on eigenval- 
ues of matrices are useful in deriving the lower bound. 

Assumption 2. Define Css = n^'^i^^Xs + Xlg^g). Define /tmin = min^wCssw. 
We assume < k^i^ < oo as n — )• oo. 

Assumption 3. Define ^max = max^ n^-'^wX^Xjw. We assume < ^max < oo 
as n — )■ oo. 

Assumption 4. Define z/max,fc = max^, n^^wXc^X^^w and 
For = 1, 2, ■ ■ ■ , m, we assume < t'max.fc < oo as n — )• oo. 

Theorem 5.4. Assume that ei's are i.i.d. as Normal{0^a'^). Further assume that 
^-'Er=i4 = 1 f^^3 = 1,2,-- - r = n~\ X = 0(71^^), pi = Oin^'), p2 = 
0(n^^'^), and p = o(n(log(n + 1))^^). Then given that Assumptions 2, 3 and 4 hold, 
the probability P{sign(/3'^) = sign(/3)} can be bounded from below in a way such that 

P{sign(r) = sign(/3)} 

'V^ L^min _ logg 

2a2 n 

',2 ,^2 



> 1 — exp < — n 



— exp — n 

— exp < — n 



i^2,n'^mm log 



U.35 — 



l.Qn l^max(?^max ~t" ^min) ^ 



(5.10) 



where si = \Si\ with = S'^ H Gr, r*^ = ipi,n, 4'2,n and ip3,n are non- 

negative constants and as n oo, ipi^n = 0(1), ^2,n = 0{n^^'^(logn)~^) and 
^3,n = 0(n3/2(logn)-i). 

The proof can be found in AppendixO The proof will start by exploring the KKT 
conditions associated to the minimization problem involving objective function (15.11) . 
Note that in Theorem 15.41 we do not assume that the irrepresentable-type conditions 



33 should hold. 
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Corollary 5.2. Assume that all assumptions and results stated in Theorem \5.4\ hold. 

Then 



P{sign(3gvsnss) = sign(/3)} 1 

as n oo. 

Proof of Corollary \5. 2[ Note that s < p = o{n{log{l+n))^'^) , therefore logs — )■ 
as 72 — i- oo. In addition, ipi^n = 0(1), therefore {2a^)~^ipin^Tain > 0- Then as 
n — )■ oo, the first exponential term in flS.lOp will approach to zero. For the second ex- 
ponential term in (IS.lOp . since %jj2,n = 0(n^/^(log(n))~^), therefore we have n~^-ip2n = 
0(n^(log(n))~^) — )■ oo as n — >■ oo. In addition, si < p = o(n(log(l + n))~^), there- 
fore n~^logs5 — )■ as 72 — )■ oo. Then as n — ?• oo, the second exponential term in 
flS.lOp will approach to zero. Furthermore, since n~^'?/'|„ = 0(n(log?T,)~^) — )■ oo and 
77,^1 log r'^ — !. as n — 7- oo, therefore the third exponential term in f lS.lOp will ap- 
proach to zero as n — 7- oo. Finally note that since r = n~^, therefore — > /3gvsnss 
as — !■ oo. The results given above imply that P{sign(/3gvsnss) = sign(/3)} — )• 1 as 
n — )■ oo, which completes the proof. 



6 Real data examples 

6.1 The U.S. industrial product index 

The data set we consider here contains the monthly-based U.S. industrial produc- 
tion index and 125 macroeconomic variables, spanning from July 1964 to December 
2010. The industrial production index is an important indicator for economic policy- 
making. Our aim here is to predict the growth rate of the industrial production index 
from the 125 macroeconomic variables. Similar data set was used in 

HSH- The 

125 macroeconomic variables are essentially a subset of the 132 variables used by 
Bai and Ng j^. For the 125 macroeconomic variables, we follow a benchmark cate- 
gorization to divide them into 8 groups: 1) output and income (01), 2) labor market 
(LM), 3) housing (H), 4) consumption, orders and inventories (COI), 5) money and 
credits (MC), 6) bond and exchange rates (BE), 7) prices (P), 8) stock market (SM). 

Now let IPt denote the level of the industrial production index at time t. We 
define the growth rate at time t + t' by yt+t> = {t')-^1200[\og{IPt+t') -hg{IPt)]. The 
plot in the top left panel of Figure |5] shows the corresponding time series trend. We 
further model the growth rate yt+t' by 

3 8 

yt+t' = ^0 + XI ^t-ivi+i + X] XI ^^J-^J + ^*+*'' ^^■^'> 

1=0 k=l jeGk 

where zt^i = 1200 [log( JPt_i) — log(/Pt-«-i)] is the Ith lag term, xtj is the jth macroe- 
conomic variable at time t, Gk is the index set corresponding to the /cth macroeco- 
nomic group, and St+t' is the error term. 
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We adopt an expanding window scheme to carry out real time estimation for 
model (16. ip . That is, we estimate parameters ?7/'s and P/s with information from 
time 1 to time t. Note that in such setting, at time t, dependent variable yt"+t' is 
only available for t" = 1, . . . ,t — t' . Let 's and 's denote the corresponding 
estimates. With model f l6.ip and the estimates, at time t, we predict yt+t' by 

3 8 

yt+t' = Vo'^'' + ^*-'^/+V*' + (6-2) 

1=0 k=l jeGk 

In practice, we let t' = 12, which corresponds to one year change. The prediction 
is started from t = 132 (June 1975) and ended at t = 546 (December 2009). Under 
this setting, there are 415 time blocks. For each time block, we applied two methods 
to estimate parameters in model (16. ip . The first method used the gvsnss to select 
the 125 macroeconomic variables and then re-estimate regression coefficients of the 
selected variables with the ordinary least squares method. For the gvsnss estimation, 
we used five fold cross validation to select the tuning parameter. The second method 
is similar to the first one but using the lasso for variable selection. For the lasso 
estimation, we also used five fold cross validation to select the tuning parameter. 

In addition, we also used principal components (PCs) of the selected variables 
to construct models for prediction. For simplicity, we use the first four PCs for the 
prediction. If the number of selected variables is less than four, we use the selected 
variables as the predictors. 

The plot in the top right panel of Figure \5\ shows the number of selected variables 
for the 415 time blocks while plots in the bottom panel of Figure E] show frequencies 
of selected variables for each macroeconomic group under the gvsnss and the lasso, 
respectively. The results show that the gvsnss estimation selected less variables and 
produced stronger between-group-sparsity and within-group-sparsity than the lasso. 

In addition, we also reported the out-of-sample mean squared error under the 
two estimation methods. The out-of-sample mean squared error is defined as 

^ T-t' 

MSE'os = j—T, Y.^yt+t' - yt+t')'. (6.3) 
t=i 

The results are shown in Table [T] and Figure El where Model 1 is the model without 
the lag terms. Model 2 is the model with the lag terms, PC is the model using the 
first four PCs of all macroeconomic variables, and AR is the model with the lag 
terms but without the grouped variable terms. The results suggest that including 
the macroeconomic variables can slightly improve the prediction results. 

6.2 Retirement plan data 

The data set, adopted from [sl, [i^, contains information about employee retirement 
plans of 92 firms. The retirement plans are managed by a company called Best 
Retirement Inc. (BRI). The response variable is the contribution to retirement plan 
at the end of the first year. It is measured at the logarithm scale. Let yi denote the 
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response variable corresponding to the ith retirement plan. Our aim here is to help 
the company to assess whether the presence of a specially trained sales, named Susan 
Shepard, has a positive effect on For the ith retirement plan, we define Xig = 1 
if Susan Shepard is present and Xjg = otherwise. The data set also contains eight 
other variables. To fully assess the presence of Susan Shepard on i/i, we will consider 
interactions between Xig and the eight variables in the regression model. We call the 
collection of Xjg and the interaction terms the "Susan Shepard Effect" group. Let 
GssE denote the set that contains indices of covariates in the Susan Shepard Effect 
group. We will jointly estimate regression coefficients of the covariates with indices 
in GssE- After some calculations, we excluded one interaction variable that has the 
same value for all retirement plans. The set Gsse therefore only contains indices of 
eight variables. 

We model the expectation of the response variable /ij = ]E(?/j| f3,Xi) by 



We applied three methods, the gvsnss with five fold cross validation, the gvsnss with 
the Bayes factor, and the lasso with ten fold cross validation to estimate parameters 
in model (16. 4p . To carry out the parameter estimations, each column of design 
matrix X was standardized to have mean zero and variance one. The results are 
shown in Figure [71 The estimation results under the lasso suggest that covariates 
in the Susan Shepard Effect group do have positive effects on the response variable 
while the results under the two gvsnss estimations imply that covariates in the Susan 
Shepard Effect group do not have such effects. 

We also carried out 100 sub-sampling estimations for the model. At each sub- 
sampling instance, we randomly split two thirds of the data into the training set and 
one third of the data into the test set. We used data from the training set to estimate 
parameters in model fl6.4p and data from the test set to compute the predictive mean 
squared error. We also computed the number of covariates with non-zero estimated 
coefficients and the number of covariates with positive estimated coefficients in the 
Susan Shepard Effect group. The results are shown in Table [2j 

7 Discussion 

We have proposed a specified prior, called the nested spike and slab prior, to model 
collective behavior of regression coefficients in grouped variable selection. We have 
developed numerical procedures for solving the optimization problem related to max- 
imum a posteriori estimation for the model. Simulation studies showed that the 
proposed estimator performs relatively well in variable selection when within-group- 
sparsity is present. However, we have found the proposed estimator will loss its 
advantage in parameter estimation if groups that contain the true covariates also 
contain too many redundant covariates. Subsequent asymptotic analysis also con- 
firmed our findings. 
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(6.4) 



ieGssE 
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With suitable modifications, the nested spike and slab prior can be extended 
to tackle grouped variable selection problems in the generalized linear models, time 
series models such as autoregressive and moving average models, or graphical models 
in covariance matrix estimation. 

Acknowledgments 

Tso-Jung Yen is supported by grants NSC 97-31 12-B-001-020 and NSC 98-3112- 
B-001-027 in the National Research Program for Genomic Medicine and Academia 
Sinica grant AS-100-TP2-C01. Yu-Min Yen would like to thank Professor Oliver 
Linton for his encouragement and helpful suggestions. 



21 



A Proof of Theorems 15.11 and 15.2 



Proof of Theorem \5.1[ Now define w = [3'^ — [3. It can be shown that w is the 
minimizer of the objective function Vt-{w* , [i , G) defined in (15. ip with respect to w*. 
Therefore Vr{w,[3,G) < VV(0, /?,(?). Here 1^(0, can be exphcitly expressed as 



K(0,/3,G) 



+ All/311 



^ log(l + r-^|/3,|) ^ log(l + r-i||/3Gj|2) 
+P^2^ log(l + r-i) +^2Z.v^- 



fc=i 



log(l + r-i) 



where e = y — X/3. Further note that 

|2 



e 



A|k + /3||^ 



= I |e| 1^ + w^iX^X + X)w - 2w^{X^e - X(3) + X\\P\\l 

With the results given above, we can compute Vr{w, /3, G) — K-(0, (3, G). In addition, 
since /3, G)— K(0, /3, G) < 0, therefore by rearranging the terms in /3, G) — 
V^(0, /3, G), we obtain 



w^(X^X + X)w 
< 2w^{X^e-X^) 

"log(l + r-i|/3,|) log(l + r->,+/3,|) 



P r 



m 



log(l + r-1) 



log(l + r-i) 



fc=i 



log(l + r-i I |/3gJ I2) log(l + r-i I \wg, + A 



(A.l) 
(A.2) 

(A.3) 
(A.4) 



log(l + r-i) log(l + r-i) 

Note that by Assumption 1, (lA.ip can be bounded from below in a way such that 

w'^iX^X + XI)w > n{Kn + Xn-^)\ \w\\l. (A.5) 
In the following discussion we derive inequahties to bound (]A.2p . ( lA.SP and (]A.4p . 

Deriving an upper bound for (IA.3P . We first derive an inequality to bound the 
difference E%i[^og{l + r-i|/3,|) - log(l + Vi + PjD- For j e = {j : 13] ^ 0}, 
\wj + I3j\ = \/3]\ > 0. Then given that r G [0, 1), for j G S'^, we have 



log 



l + r-i|w,- + /3. 



< 



< 



log ( 1 + 



1/3.1 -ki + ^i 



T + \Wj + (3j\ 

Wj+(3j\ 



T + \Wj + (3j\ 



m - 


Wj + 13 j 


+ 




T + 







(A.6) 
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Now for j G n S^, we have f3j = 0, therefore for j G S*"^ fl S'^, the right hand side 



of ([Qj) is zero. For j E (1 S, note that - \wj + l3j\ < \l3j 



\w 



Then with the result given above, we have 



En 



1 + r-i|w,- + f3j 



\l3j - Wj 




+ 




T + 


Wj + I3j\ 





j&s^ns 



(A.7) 



where C2 = min^.g^^ 

Now consider the summation over indices j G {S^y. Note that for j G (S^)^!"! S*"^, 
we have /3J = /3j = 0, therefore the difference \og{l + T^^\f3j\) — \og{l + T^^\wj + f3j\) = 
0. On the other hand, for j G {S^Y fl S, we have |wj + = |/3J — /3j + =0 and 
= I/3J — = lifjl- Therefore for j G (5^)^ fl S, we have 

log(l + r-i|/3,|) - log(l + r->,- + /3,\) = log(r + \wj\) + log(r-i). 

In addition, for r G [0,1), log(r + \wj\) < log(l + \wj\) < \wj\. Now with C3 
minj65|/3j|, we have C3 < mm.^^§,^,^s \f3j\ = mm-^^g^^^^g 

{S'^y n S. Therefore with the results given above, we have 

J2 log(l + r-i|/3,|)-log(l + r-V 

< E Kl[l + C3Mog(r-i)] 

< [l + C^Hog{T~')]Y,\Wj\ 

< [1 + c^Hog{T-')y/^\\w\\2. (A.8) 

For r G [0,1), we have [log(l + t^^)]^^ < [log(r^^)]^^. Now combining results in 
(1A.7P and (lA.Sp . we can bound (IA.3P in a way such that 



Wjl < \wj\ for any j G 



< 



< 



p 

P.E 

Pi 



log(l + r-i|/3,|) log(l + r-V,+/3,|) 
log(l + r-i) log(l + r-i) 

1 + ^-^1/3.1 



log(r-i) 



$:io, 



+ 5^10^ 



1 + T-Mwj + I3A 



Pi 



log(r 



- {2c, V/^l 1^.1 b + [1 + C3 Mog(r-i)] .^/^l 1^1 b} 



Pi 



2c^^ + 1 

log(r-i)+"^ . 



.1/21 



W\\2. 



(A.9) 
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Deriving an upper bound for (]A.4[) . Similarly, for k G = {k : \\^gJU > 0}; 
we have \\wg^. + Pajh = ll/^Cfelb > 0. In turn, we have 

Ipg/^ 1 + ^'^I|/3gJ|2 \ ^ ||/3Gj|2-|kG,+/3Gj|2+|kGj|2 _^^^^^Q^ 



for k G i?'^. 

Now if k E n R^, where = {k : H/^Gklb = 0}, then the right hand side of 
(lA.lOp is zero. On the other hand, for j G G^r H S, we have C2 = min^g^^ < 

i^i^fce/J- I|3gJ|2 < IIA^Jb- In addition, ||/3gJ|2 - Ikc^ +/5gJ|2 < WPg^ - wg^ - 
/^Gfclb = llw^Gfelb- Then with the results given above, we can further obtain 



^ 



-2||U'G 



C2 

/ \ 1/2 / X 1/2 

< 2c-($:y^M ($:ikGj|^ 

< 2c2igi/'||w;||2, (A.ll) 

where qr = \Gr\ = J2k(^R Ik is the number of indices covered by Gr. We now consider 

the summation over indices k G {RY. If A; G {Ry, WPhJl'^ ~ 0- Therefore, we have 

lkG, + /3Gj|2 = PS,-/3G,+/3Gj|2 = 0and | |/3gJ b = - /3gJ I2 = I ^gJ In 
turn, 

log(l + t-^WPg, 1 12) - log(l + r-'\\wG, + Pg, 1 12) = log(r + \\wg, | I2) + log(r-i) 

forfcG (^n^^- In addition, for TG [0,1), Mr+PcJb) <log(l + ||/3Gj|2) < W^gJU- 
Further note that 

C3 = min|/3j|< ^ min ||/3c.J|2= min IkcJb- 
JG5 fce(K-)M|/3Gj|27^o fce(iJ-)M|«;Gjl27^o 

Moreover, for an arbitrary index k G {Ry, \ \wgJ\2 = H/^GsIb, therefore ||wgj.||2 7^ 
implies H/^G^Ib 7^ and the index k E R. Now by applying the results given above. 
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we have 



J2 v%[ log(l + \\(3g, 1 12) - log(l + I \WG, + 1 12 

k&{R-r,\\wGj\2j^0 

+ Yl V^[log(^+lkGj|2) + log(r-i)] 
fce{fl-)MkGj|2=o 

< [l + C3-Mog(r-i)] 5^ v^lkcjb 

fce{-R^)^fcG-R 

< [l + c^HogiT-')]q]l'\\w\\,. 

Combining the results in flA.lip and flA.12p . we can bound flA.4p in a way such that 

"log(l + r-i I 1 12) log(l + r-i I \wg, + (3g, 1 12) 



(A.12) 



< 



k=l 
P2 



log(r-i) 



log(l + r-i) log(l + T-^] 



keR-^ 



< 



P2 



r-^\\wG,+M\2 
l + r~'\\^Gj\2 



r 

[2c^\f\\w\\,+ [l + c,Hog{r-')]qf\ 



log(r 



I ""^112 



P2 



2Cn' + 1 



V2|| II 



_log(r-i 

Deriving an upper bound for (IA.2p . First note that 

w^X'^e < ||ti;||i||X^e||oo. 
Now for ||w||i, we can decompose it as 

ll^lli = ll^s-^nslli + Il'"^5^n5=lli + Il'"^(5^)=n5lli + ll^{5^)=nS'-lli- 



(A.13) 



(A. 14) 



Note that for the first and third terms on the right hand side of (lA.14p . we have 

ll'^^s^nslli — ll^slli I l^(s-^)<=nsl li — ll'^slk- -^^^ second term on the right 
hand side of f lA.14p . we have I l^t'sTpi^cl |i < The fourth term on the right 

hand side of (1A.14P is zero since (S''^)'^nS"^ is an intersection of indices for entries with 
zero values in /3 and entries with zero values in P'^. With the results given above, we 
can further bound in a way such that 



kill < 2||w5||i + ||wg. 











\w\\ 









(A.15) 
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With the result in (]A.15|) . we can bound (]A.2|) in a way such that 



2w'{XU-Xl3) < 2||w||i||X'e||oo + 2A|m;'/3| 

1/2 

< 25^/2 ■ 



2+1 - 

s 

,1/2. 



wMX'el 



+2A| |w| I2S ' max 
Combining the results (lA.Qp . (]A.13|) and (]A.16|) . we obtain 

l/2n 



(A.16) 



n{Kn + Xn-^)\\w\\l < 2s^/2 



+Pi 

+P2 



s 



2C2 ^ + 1 _i' 

log(r-i) + 
2^ + C3- 



II2 
.1/21 



iX^elL + 2As^/^ max|/3,-|||w||2 



log(r-i) 



1/2|| II 



(A.17) 



Then by using the fact that s = l^l < \Gji\ = Qr and doing some rearrangement in 
(lA.17p . we obtain the inequality f l5.4p . which completes the proof. □ 
Proof of Theorem I5.M We start our proof by showing that with at least 1 — a 
probability, the inequality 2||X^e||oo < 'ipn will hold, where ipn is defined in fl5.5p . 
Note that {2||X-^e||oo < V'n} is equivalent to the following event: 



m f 



We will establish the inequality P(^) = 1 — P(^^) > 1 — a by showing that given 
il)n is defined in flS.Sp . P(^'^) < a. The technique we use to derive the inequality 
P(^'^) < a is borrowed from Lemma B.l of fs^. Note that the tail probability P(^'^) 
can be bounded in a way such that 

= p|^|j|2||X^^6|U>^ 

m ^ I \ m / n 

k=l ^ ^ k=l jeGk ^ *=1 



(A.18) 



Under assumptions given in Theorem I5.2[ e^'s are i.i.d. normal variables with mean 
zero and variance o"^, therefore Yl^= 

normal variable with mean zero and 



variance 



CjCr'^- In turn, we can express l^^^ia^jjCj 



'n(ja\Z\ 



where Z is a standard normal variable. By using the Chernoff bound argument on 
the tail probability of a standard normal variable, we can bound the right hand side 
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of (lA.lSI) in a way such that 



EE'! E 

k=l jdGu 



Xij€i 



i=l 



> 



> 



k=l 



2y n maxj (jcr 



8n maxj Qa'^ 



< mexp 



+ logg 



(A.19) 



where q = m ^ Ylk=i Ik- With ipn defined in fl5.5p . the right hand side of (1A.19P is 
equal to a, and further with ( lA.lSp . we obtain P(^^) < a, which imphes that with 
defined in ([ESD, P(^) = 1 - P(^'=) > 1 - a. 



To complete the proof, note that since we have assumed r 



n 



therefore 



effectively we have Z?"^ — )■ /3gvsnss and — )■ s~ as n — > oo. Therefore with the result 
from Theorem 15. II and the assumptions on A, pi, p2 and r, as n — )■ oo, the inequality 

AnV'n 



HA 



gvsnss 



< 



1/2 



n 



(A.20) 



will hold with 1 — a probability, where A„ is defined in (15. 7p and fi„ = n is defined 
in fl5.8p and ipn defined in f lS.Sp . which completes the proof. □ 



B Proof of Theorem 15.3 

Proof of Theorem 15.31 Now define 

.log(l + r-i||/3' lb) 



log(l + r- 



-1^ 



k=l 

m** 



_log(l + r-^||/3^«||2) 



where is the coefficient vector in which the elements are those indexed by G). 
in the vector /?'. The vector P'q** follows a similar definition. The function ( IB.ip 
is the difference between the log-sum penalties involving /2-norms indexed by G* 
and G**. Note that, with (IRT]) . the objective function V;(0,/3',G*) in (ICTI]) can be 
re-expressed as 

V;(0, /?', G*) = Vr{0, P', G**) + UriP', G*, G**). (B.2) 

Since /3" is the minimizer of Vr{0, f3',G*), therefore it must be the solution to the 
following subgradient equations: 

2X^{y - X(3') - 2\/3' - pig' - psv^w** - p2iV¥u* - ^/^u**) = 0, (B.3) 
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where 



(9% 



[rlog(l + r-i)](l + r-i|/3j|)' 
with h'^ = sign(/3;.) if (3'j ^ and -1 < h'^ < I if (3'j = 0, and 



[rlog(l + r-i)](l + r-ip^ 



G*.ii2; 



with V* = P'^/WP'a, II2 if Wg* II2 > and E,-eG* K*)' < 1 if Wg* II2 = 0, where 

kj is the index for the group that j belongs to, i.e. if j G Gl,, then kj = k'. The 
quantity {u**)j follows a similar definition. In addition, {q*)j = ql_. and {q**)j = ql* ■ 
Note that the derivation of the subgradient equations (1B.3P has explicitly used 
representation (IB.2p . and after some simple arrangement, (1B.3P becomes 



2X^{y - Xf3) - 2\f3 - pig' - p2^u** = p2{V¥u* - Vq^u**) 



(B.4) 



where 



'q^u* — y/q**u* 



log(l 



h V 



ir+WP'Gr \\2){t + WP'arAU) 



■3 j 



(B.5) 



For each j, one of the following four cases will occur: (i) 11/3^* 1 12 = and 1 1 12 = 0; 

(ii) 11/3' II2 > and 11/3' 1 12 = 0; (iii) ||/3' lb = and \\P'\\2 > 0; and (iv) 

11/3^, II2 > and ||/3q..||2 > 0. In the following discussion, we will evaluate (IB.Sp 

under the four cases. 

We consider case (i) first. If (i) occurs, then all regression coefficients with indices 
in Gl. or G^* will be zero. It implies that /3j = and by definitions, v* is an arbitrary 



quantity such that < {v*Y < J2jeG* ("^j)^ — ^- "^^^ same property applies to v 
For practical purposes, we choose v* = t and v** = r. Then under case (i 



3 ■ 



'q u — yq**u 



T log(l + 
log(l + r-i) 



r 



(B.6) 



Now consider case (ii). If (n) holds, then by definition, v* = /3j/||/3G* ||2- In 
addition, since ||/3g**||2 = 0, therefore v** is an arbitrary quantity such that < 

28 



< E 



< 1. For practical purposes, we choose v** = r. Moreover, 



||/3^**||2 = implies that all coefficients with indices in G*i* are zero. Therefore 
/3' =0 and V* = (3'J\\P'a* lb = 0. Then under case (ii). 



— ^/q**U**)j 



1 



r+lWc, II2 



II2 



log 1 + r 



-1^ 



(B.7) 



Now consider case (iii). Under case (iii), since | \(3q** | I2 > 0, therefore v** = (3'J\\(3'q,, | I2. 
In addition, II2 = implies that all coefficients with indices in G\. are zero. 

Therefore (3'^ = and v** = /3j/||/3G**ll2 = 0. In addition, v* is an arbitrary quantity 
such that < (v*)^ < J^jeG* (^j)^ — ^- Here we let v* = t. Therefore under case 



m 



'q*u 



'q**u 



1 



rlog(l + r-i) 



r+\Wc 



log(l + r-i) 

Finally we consider case (iv). Under case (iv), v* = fi'j/\\P'G* II2 and v* 
1/3^.. 1 12. Further by direct calculation, we have 



(B.8) 



'q*u — y/q**u 



1 



log(l + T- 
1 

log(l + T- 



War 



II2J 



(r+||/3^..||2 



"3 



-J ■ i 



(B.9) 



Now with T = and the results from flB.6p . (IB.7p . (IB.Sp . and (IB.Qp . we can 
see that \Ur{P' ,G* ,G**)\ = 0{p2m.axk y/qk[^og{n)]~^). Therefore if p2 niaxfc = 
o(log(n)), Ur{P', G*, G**) will approach to zero when n — )■ 00. It further implies that 
the right hand side of flB.4p will become zero when n — )■ cxd. On the other hand. 
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the left hand side of (]B.4|) is just the subgradient vector of the objective function 
Vr{0, f3',G**). Therefore when r — )■ 0, (1B.4P becomes the subgradient equations of 
hniT-^o K-(0, /?', G**). Since /3^snss is the solution of the subgradient equations ( IB.SP 

when r — > and (]B.4|) is just a rearrangement of (IB. 31) . therefore (3*-^snss i^ ^'iso 
the solution to ( 1B.4P when r — 0. Since (1B.4P becomes the subgraident equations 
of lim^^o K-(0, /?', G**) when r ^ 0, and the solution of (1B.4P at r is the 
minimizer of lim^_>.o K-(0, /?', G**), there we conclude that /Sgvgnss is the minimizer of 
limT-_>o K-(0, /?', G**), which completes the proof. □ 



C Proof of Theorem 5.4 



Proof of Theorem \ 5.4\ Define w = /3'^ — /3. It can be shown that given /3 and G 
are fixed, w is the minimizer of Vr{w*, G), therefore w is also the solution to the 
following subgradient equations: 



where 



2X^Xw - 2X^e + 2A(/3 + w) + pig + p2^u = 0, (C.l) 



[rlog(l + r-i)](l + r->,+/3,-|)' 
with hj = sign(wj + f3j) if Wj + f3j and — 1 < /i^ < 1 if + /3j = 0, and 



(u. 



^3 



' W log(l + r-i)] (1 + r-i I . + , I b) 



with t;^- = + +I^gJ\2 if Ike, +/3g,J|2 > and EjeG,.(^j)^ < 1 if 

J J J J 3 

W'^Gk- + Pck. lb = 0, where fcj is the index for the group that j belongs to. 

Let Sl = S'^ r\ Gr and S2 = H Grc. Here S'^ is the set of indices for redundant 
covariates, i.e. the covariates with zero coefficients. In addition, SI is the set of 
indices for the redundant covariates covered by Gr, and 6*2 is the set of indices 
for the redundant covariates covered by Grc. By definition, Grc C S"^, therefore 
we have S2 = Grc In addition, SI and Grc are three disjoint index sets and 
S* U S'f U Grc = {1, 2, ■ ■ ■ With the results given above, we can re-express (]C.1|) 
as 



X'^Xs XjX^c XsXgj^c 
X^cXs XgcXsi XgcXcj^. 

^Gnc^S Xl^^Xsi 
I Ws + Ps 

+2A wsi + Psi I + Pi I 9si I + P2 I ^usi I = 0. (C.2) 
\wGnc + /3gj 

For practical purposes, we define as the position of index j in the set S. It is 
equivalent to say that index j is the 'd^th. element in S. If j ^ 5, then we just leave 

'd^ undefined. Similar definitions are applied to ■(9^^ and ^^^^ . 
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To make the sign consistency hold, we must have wj = PJ — (3j = for all 

j ^ SiU Grc, and sign(/3j) = sign(/3j) for all j G S. Given that w is the solution to 
(]C.2|) . then with the arguments given above, we obtain the following conditions: 



{XsXsWs - Xje + X{ws + (3s) + ^Vqsus)^s = - ^9s 



(C.3) 



^1 



for j E S, and 



2rlog| 



Pl f T T P2 \ 



qC 



< 



Pl 



2rlog(l + r-i: 



for i G SI, and 



\2XlXsWs - 2X1 e + Pi9g, I b < 



rlog(l + r-i) 



(C.4) 



(C.5) 



for k G R". 

The subgradient equations flU.Sp are a result from the KKT conditions and the 
inequalities flC.4p and f lC.Sp are used to ensure that estimated coefficients with indices 
in SI and Gr^ are zero. 

Now by solving equations in (1C.3I) for ws, we have 



Ws 



P2 



2^3 + -jVlsUs + A/35 



(C.6) 



Note that the -^jth element in the last term on the right hand side of ( ]C.6p can 
be expressed as 



-^9s + -^Vqsus + Xps 



Tpihj 



2[rlog(l + r-i)] 
(^+ Ike,, +I3gJ\2) 



_{T+\Wj + /3j\) 

2Arlog(l + r-i)/3,- 

(C.7) 



Here we define Bs^r by 
rpihj 



{Bs,r) 



+ 



Tp2jqkjVj 



+ 2Arlog(l + r-i)/3,-. (C.8) 



(r+|w;, + /3,|) {r+\\wG,^+(5G,^\\2) 
By Assumption 2, Css = n^^{X'gXs + A/). Practically we can express ws as 

Ws = n-^C^^X^e — ^ -C^^Bsr- (C.9) 

* ^ 2nrlo(; l + r-i ^' ^ ^ 
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Sign consistency for estimated coefficients with indices in S. Now in 

order to ensure the sign consistency for estimated coefficients with indices in S, we 
impose some constraint on each entry of ws- We focus on the following inequality: 



(C.IO) 



Inequality (IC.lOp implies that for j G S, sign(/3j) = sign(/3j). To see why it is, let 
us consider the case when (3j > 0. If (3j > 0, then \wj\ < \f3j\ means that either 



or 



< 



j3j — /3[ < which jointly imply that 



< /3J < 2/3j. A similar argument can be applied to the case when /3j < 0. Therefore 
given that (]C.10|) holds, sign consistency holds for estimated coefficients with indices 
in S. 

With representation (lC.9p . for j G S", we can bound \wj\ in a way such that 



< n 



-1 



1 



2nT log(l + r 



{CssBs,t) 



(C.li: 



By plugging the right hand side of f lC.lip into the left hand side of (lU.lOp and doing 
some rearrangements, we obtain the following inequality: 



n 



1 



2nT log(l 



r 



{CsgBs,r) 



(C.12) 



Further note that for any j G 5, 



< \\c. 



^ '^minll^J'^l 



where Kmin is the minimum eigenvalue of Css- Now with the results given above, we 
construct the following event: 



El 



< min I Pi I 



1 



2nT log(l 



— 1> 



(C.13) 



Since the left hand side of the inequality stated in Ei is larger than the left hand side 
of (lC.12p . and the right hand side of the inequality stated in Ei is smaller than the 
right hand side of (lC.12p . therefore if the inequality stated in Ei hold, then (1C.12P 
will hold. In turn, fIC.Sp and fIC.lOp will hold, and the sign consistency for estimated 
coefficients with indices in 5* can be established. 

We go on to derive an estimate for the tail probability of Ei. Define ipi^n by 



V'l.n = min - - — - — — 

jes 2nrlog(l + 



S,t\ 



(C.14) 



Note that Ei is equivalent to the event njgs{?2~"'^| Yl^=i^ij^i\ < V'i,n^mm}- On the 
other hand, by the assumptions on e^'s and ^"^^ xjj, one can show that XlILi -^u^* 
a normal variable with mean zero and variance cx^V x?- = na^. Therefore, we can 
bound the probability of E^ in a way such that 



F{El 



i65 



E 



l,n'^min 



< sP \Z\ > 



a 



(C.15) 
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where Z is a standard normal variable. By applying a Chernoff bound argument to 
the right hand side of (]C.15|) . we further obtain 



< 



exp 



nV^2 ^2 



2(t2 



+ log s = exp 



n 



V^l,n<in logs 



2^2 



n 



(C.16) 

Sign consistency for estimated coefficients with indices in S^. Now by 

plugging ( 1C.9P in the middle term of f lC.4|) and then taking absolute value on the 
quantity, for j G S^, we have 

^1 ^ ss s 2nrlog l + r-i ^ ^' 



< n 



-1 



{XlcXsCssXse)^si + {Xsae)^si 



-IvT 



1 




' 2rlog(l + r-i) 


(l + r-i| 




b) 



2nrlog(l + r-i) 



(C.17) 



By plugging the right hand side of (1C.17|) into the left hand side of (]C.4p and doing 
some rearrangements, we obtain the following inequality: 



n 



< 



{XgcXsCssXge)^si + (Xcjce)^sj 
1 



2rloKri + r-i: 



pi-n 



{XlcXsCslBs,r)^si 



(l + r-i||w7G,^ +/3g,J|2) 

(C.18) 

Note that by Assumption 3, the maximum eigenvalue value of the matrix X^Xj is 
'^Wax- Therefore, 

V n-ivT^\ \ ^ wvT V n-ivT\\ ^ \\vT 



\{XgcXsCggXge)^si\ < \\XgcXsCggXgt\\^ < '^Wx/tminl l^sj^l 



Further define 



:i + r-i|kG,, +/3g, II2) 



(C.19) 
(C.20) 



With (1C.19|) and f lC.20p . we construct the following event: 



< 



^min 
1 



1 1 1 |Xjce| |oo 



rlog(l + r-i) 



— ^pi — n ^WX^c^XsC glBs^rWoo — ||-Bs=,t||oo^ 



(C.21) 
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Since the left hand side of the inequahty stated in E2 is larger than the left hand 
side of (IC.lSp . and the right hand side of the inequality stated in E2 is smaller than 
the right hand side of (IC.lSp . therefore if the inequality stated in E2 holds, then 
(IC.lSp will hold. In turn both (IC.3P and (lC.4p will hold, and the sign consistency 
for estimated coefficients with indices in can be established. 
Now define 'tp2,n by 



2,n 



rlogfl + r 



— ^^pi— n ^||XjcXs'C5_^-Bs',^||oo — ll-B, 



5f,r I 1 00 



(C.22) 



Then following the technique similar to the one used in deriving (IC.lSp and (10.16^ . 
We can bound the probability of E2 in a way such that 



F{E!^) < exp - 



2, n mill 



(?max ~l~ '^^min) 



+ log sl 



exp < — n 



2 

min 



log si 



)2^2 



n 



(C.23) 



Sign consistency for estimated coefficients with indices in Grc. Now by 
plugging fIC.Qp into the left hand side of (IC.Sp . we have 



1 



< 2 



nr log(l + T 



Pi 



T log(l + r 1 
1 



he. 



nr log(l + T' 



{1 + t-^\wg, + (3gA) 



(C.24) 



for k G R^. Further by plugging the right hand side of flC.24p into the left hand side 
of fIC.Sp and doing some rearrangements, we can obtain the following inequality: 



n 



^Gk^sCgs^st 



+ \\XcA\2 



< 



2rlog(l + r 
-Pi 



— \p2y/qk-n 

he. 



^\\^Gi.^sCssBs,r 



(C.25) 



By Assumption 3, the maximum eigenvalue of the matrix Xq^Xq^ is nvk,m&x- Further 
note that 



\XQ^XsCglXlt\\2 < ^Wax^tmLl I^Cfc^l I2 < ^Wx/^min V'^'^fc.maxl |e| |2- (C.26) 
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With (1C.26|) . we construct the following event: 



+ 1 jy/nu^ ,max "^ 2 



< 



1 



rlog(l + r-i) 



— max pi 

k€R'= 



P2 



he. 



(1 + r-^WG, + Pg, 



for all keR"" 



(C.27) 



Since the left hand side of the inequality stated in is larger than the left hand side 
of flC.25p . and the right hand side of the inequality stated in is smaller than the 
right hand side of flU.25p . therefore if the inequality stated in E^ holds, then f lC.25p 



will also hold. In turn, if (^15^ holds for all k e R^, then both f lU3|) and flU^j) will 
hold, and the sign consistency for estimated coefficients with indices in Grc can be 
established. 

We follow a strategy similar to those given above to derive an estimate for the 
tail probability of E^. Define ip^^n by 



rlog(l + r-1) 



max pi 

keR" 



P2 



ho. 



(l + r->G, +/3gJ) 



(C.28) 



Note that Es is equivalent to the event nfceijc{4nz/fc,maxfi;min(Wax + /tmin) | |e| I2 < 



Therefore the probability of E^ can be bounded in a way such that 



> 



)V2 



\e\\l 



> 



^minV's,?! 



4nz/, 



max V^max 



(C.29) 



In addition, since e's are i.i.d. normal variables with mean zero and variance cr^, 
therefore ||e||2/o"^ is a Chi-square variable with n degrees of freedom. It can be 
shown that E[exp(a| |e| l^/a^)] = (1 - 2a)-"/2 for a < 1/2. We let a = 1/4, then 
E[exp(4~-'^||e||2/cr^)] = 2"/^. Wit the arguments given above, the probability of E^ 
can be further bounded in a way such that 

-2 ,2 



< 



exp 



n 



+ - log 2 + log r' 



< exp < — n 



16?^ i^max('?max ~l~ ^min) 



-0.35- 



n 



(C.30) 



Since Ei, E2 and E^ jointly implies conditions (IC.3p . f IC.lOp . flC.4p and f lC.5p . which 
further implies the sign consistency sign(/9'^) = sign(/3), therefore 

P{sign(/3") = sign(/3)} > ¥{Ei n n Eg) = 1 - P{(^i n E2 n E^y}. 
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Further note that P{(Ei n n E-sY} = F{E1 UE^U E^) < F{ED + P(E|) + P(^3). 
Therefore we have 



P{sign(/30 = sign(/3)} > 1 - FiE'^) - FiE:^) - F{E',). 



(C.31) 



Then by applying the tail probability results (lC.16p . (]C.23p and (IC.30p to construct 
a lower bound for the quantity on the right hand side of (lC.3ip . we recover the 
inequality f lS.lOp . 

Asymptotic behavior of ipi^ni V'2,n and ips^n- Now we go on to show that as 
n — )■ cxD, ipi,n, ^2,n and ipa^n, defined in flC.14p . flC.22p and flC.281) . respectively, can 
satisfy the requirements stated in Theorem 15.41 We first consider the asymptotic 
behavior of {Bs^t)^s, which is defined in flU.Sp . Note that by assumptions, if \wj + 

l3j\ 7^ 0, then hj = 1 or hj = —1. Therefore given that r = and pi = 0{n^^'^), 
the first term on the right hand side of (IC.SP will be 0{n^^^'^). In addition, if 
\wj + Pj\ =0, then hj is an arbitrary quantity in [—1, 1]. In this situation we may let 
hj be proportional to n~^, then the first term on the right hand side of (IC.Sp will be 
0(n~^/^). An argument similar to the one given above can be applied to the second 
term on the right hand side of flC.81) . Further note that given A = 0(^1/2), the third 
term on the right hand side of flC.81) will be 0(ra~^/^ log(l + ra)). With the arguments 
given above, we conclude that 



S,t)'SS 



0{n-^'^ log(l + n)) 



(C.32) 



for all j G S. An argument similar to the one given above can be applied to {Bs^, 
in flC.20p and the term pi\\hG^{^ + T^^\wG^, + /^g^ l)""*^! I2 in -E3, which leads to 



for all i G SI and 



Pi 



he. 



0{q, 



l + r-i|wG, + A 



(C.33) 



(C.34) 



for all k^R". 

Next we go on to deal with the Zoo-norm terms involved in "^/^i^n, '^2,n and i\)z,n- 
First note that for a p dimensional vector 6, we can bound ||&||oo in a way such that 



, = A/maXj < ■\jY^j=\ ^"j — v^^- Therefore for ipi^n defined in (IC.14p . we 
can bound the term | |C^^i?5^T-| |oo in a way such that 



l^ssBs,- 



< 



O 



log(n + 1) 

71/ / ^min 



Now consider 1^2,11 defined in flC.221) . First note that since C Grc 
;erm n~^\\XgcXs 

^\\XjoXsCssBs,, 



can bound the term n iXjcA^C^^-B^^^I |oo in a way such that 



n 



< 
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n 



maxn 
fceft 



T 

Gr' 

-111 



00 



^Gk^sCssBsAloo- 



(C.35) 
therefore we 

(C.36) 
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The right hand side of (]C.36|) can be further bounded in a way such that 



k(.Rc II SS i.rlloo _ ^^^^ V S,T 



"mm 



■5 ^ y^i^Vnax^max losl-'- 



O ^ ^ . (C.37) 



A similar argument can be apphed to the term max^g/jc -n, ^| iX^^X^C^^-B^^rl I2 in 
ips^n defined in flC.28p . which leads to 



S"-'l|A-S.A-,C,-iB.,.|b = 0(^^^^^^^P|^^^). (C.38) 

Note that we have assumed p = o(n(log(n + 1))^^) and since s < p and qk ^ p 
for k = 1,2, ■ ■ ■ ,m, therefore we have s^^^ = 0(^-1^/^ (log(ra + 1))~^) and = 
o(n^''^(log(l + n))^^) for k = 1,2, ■ ■ ■ ,m. Then with flC.35p . the second term on 
the right hand side of flC.14p will approach to zero as — )■ cxd, therefore we have 
4'i,n = 0(1) as n — 7- 00. In addition, with results in (IC.33p . flC.36p and flC.37p . 
the second and third terms on the right hand side of f lC.22p will approach to zero 



as n — )■ 00, therefore we have ip2,n = 0{n^^'^{\ogn)~^) as n — )■ 00. Moreover, with 
results in (1C.34P and f lC.38p . the second and third terms on the right hand side of 
(1C.28P will approach to zero as n — i- 00, therefore we have ip^^n = 0(n^/^(logn)^^) 
as n — i- 00, which completes the proof. 
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Table 1: Out-of-sample mean squared error. Each vahie is an average over the 415 
time blocks. The value in tlic bracket is tlic standard error. 



Method 


Model 1 


Model 2 


gvsnss 


16.99 (1.83) 


16.67 (1.81) 


lasso 


21.87 (1.74) 


22.48 (1.81) 


gvsnss-PC 


17.50 (1.90) 


17.03 (1.86) 


lasso-PC 


17.66 (1.83) 


18.39 (1.88) 


PC 


16.75 (1.88) 


17.61 (1.92) 


AR 




18.68 (2.03) 



Table 2: Estimation results based on 100 sub-samphng simulations. Each value is 
an average over 100 sub-sampling simulations and the value in the bracket is the 
standard error. PMSE: Predictive mean squared error; 's: The number of covariates 
with non-zero estimated coefficients; SgsE- The number of covariates with positive 
estimated coeffi cients in the Susan Shepard Effect group. 





gvsnss 5CV 


gvsnss BP 


lasso lOCV 


PMSEtest 

's 

^+ 
•^SSE 


0.43 (0.01) 
2.81 (0.25) 
0.35 (0.13) 


0.38 (0.01) 
1.03 (0.02) 
0.00 (0.00) 


0.41 (0.01) 
3.89 (0.22) 
1.33 (0.09) 
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Figure 1: Left: The index function and its log approximations; Right: The mean 
absolute difference between the index function and its log approximation as a function 
of — logr. Each point is an average over absolute differences with input values from 
[-10,10]. 
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spr = , mis-labeled = 



spr= 0.3 , mis-labeled = 



spr = 0.6 , mis-labeled = 




spr = , mis-labeled s 



spr = 0.3 , mis-labeled = 



spr s 0.6 , mis-labeled s 




GVSNSS-5CV 

GVSNSS-BF 

glasso-AIC 

glasso-5CV 

lasso-iOCV 



GVSNSS-5CV 
GVSNSS-BF 
glasso-AIC 
glasso-5CV 




Figure 2: Estimation results from simulated data. Each point is an average over 100 
replicates. For all data sets, we set mis-labeled — 0, p — 200, m — 10 and r — 2. 
Left: spr = 0; Center: spr = 0.3; Right: spr = 0.6. Top: SFPR; Middle: I2 distance 
between the estimates and the true values; Bottom: Logarithm of the PMSE with 
respect to base 10. 
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Figure 3: Estimation results from simulated data. Each point is an average over 100 
replicates. For all data sets, we set mis-labeled = 0.1, p — 200, m = 10 and r — 2. 
Left: spr = 0; Center: spr = 0.3; Right: spr = 0.6. Top: SFPR; Middle: I2 distance 
between the estimates and the true values; Bottom: Logarithm of the PMSE with 
respect to base 10. 
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Figure 4: Estimation results from simulated data. Each point is an average over 100 
replicates. For all data sets, we set mis-labeled = 0.5, p — 200, m — 10 and r — 2. 
Left: spr = 0; Center: spr = 0.3; Right: spr = 0.6. Top: SFPR; Middle: I2 distance 
between the estimates and the true values; Bottom: Logarithm of the PMSE with 
respect to base 10. 
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gvsnss 5CV 



COI MC BE 



I ill 



Jl III! . Ii.ll I 



COI MC BE 



Figure 5: Top Left: Percentage change of the U.S. industrial production index. The 
change is defined as 100[log(JPj) — log(/Pt-i2)]- Top Left: The number of selected 
variables for the 415 time blocks. Bottom Left: Frequencies of variables being se- 
lected under the gvsnss. Bottom Right: Frequencies of variables being selected under 
the lasso. 01: output and income; LM: labor market; H: housing; COL consumption, 
orders and inventories; MC: money and credits; BE: bond and exchange rates; P: 
prices; SM: stock market. 
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Model 1 



Model 2 




Figure 6: Left: The out-of-sample squared error of Model 1 for the 415 time blocks. 
Right: The out-of-sample squared error of Model 2 for the 415 time blocks. 



gvsnss Bayes Factor 



Other covatiates 



Susan Shepard Effect 



Other covar iales 



Susan Shapard Effect 



Other covariates 



Susan Shepard Effect 



Figure 7: Estimation results from the retirement plan data. Left: The gvsnss esti- 
mation with five fold cross validation. Middle: The gvsnss estimation with the Bayes 
factor. Right: The lasso estimation with ten fold cross validation. 
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